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Application of STOJANCIC, et al 

PRE-COMPUTATION AND DUAL-PASS MODULAR ARITHMETIC OPERATION 
APPROACH TO IMPLEMENT ENCRYPTION PROTOCOLS EFFICIENTLY IN 
ELECTRONIC INTEGRATED CIRCUITS 

FIELD OF THE INVENTION 

The present invention relates to encryption of electronic messages, and more 
specifically, to a method and apparatus for using pre-computation and dual-pass modular 
arithmetic operations to implement encryption protocols efficiently in electronic hardware 
such as integrated circuits. 

BACKGROUND OF THE INVENTION 

Some encryption protocols for electronic messages are based on modular 
mathematics, such as the Diffie-Hellman protocol and the Rivest-Shamir-Adleman (RSA) 
protocol. As an example of modular math, consider the expression X mod M = Z, where X is 
the operand, M is the modulus, and Z is the remainder. The value for the remainder, Z, is the 
same as the remainder from integer division of the operand, X, by the modulus, M. As a 
numerical example, consider 27 mod 10 = 7. The remainder, 7, is what is left after the 
operand 27 is divided by the modulus 10. The remainder 7 may also be referred to as the 
modular reduction of 27 modulo 10. 

Modular arithmetic includes a variety of modular operations, including but not 
limited to, modular addition, modular subtraction, modular multiplication, modular division, 
and modular exponentiation. For example, XY mod M = Z is an example of modular 
multiplication in which Z is the modular reduction of the result of multiplying X by Y. As 

Y 

another example, X mod M = Z is an example of modular exponentiation in which Z is the 
modular reduction of the result of raising X to the power Y. 

Many encryption protocols rely on numerical "keys" that are used to encrypt and 
decrypt messages. Some protocols use private keys that are kept secret except from the 
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parties exchanging the encrypted messages, while other protocols use a combination of 
private and public keys in which the public keys are freely distributed to the public at large 
while the private keys are kept secret. 

Generally, the longer the key size used in a particular encryption protocol, the better 
the level of security that can be achieved. However, some encryption protocols involve 
modular operations, such as modular multiplication and modular exponentiation, which are 
computationally intensive, particularly for large operands that are associated with using 
longer keys. For example, for 2048 bit keys, a typical hardware implementation of the RS A 
protocol involves millions of logic gates and very high clock frequencies, which makes such 
hardware implementations impractical for widespread use. Therefore, hardware 
implementations of the RSA and other encryption protocols are generally limited to shorter 
keys to reduce computational requirements, but such shorter keys provide less security than 
longer keys. 

One approach for performing modular operations for large operands is based on 
Montgomery's method, which is a modular operation algorithm where one modular reduction 
is performed at each iteration of the modular operation by a shift instead of a division. For 
example, given two operands, X and Y, and a modulus, M, the modular multiplication 
operation for computing the result, Z, of X Y mod M based on Montgomery's method may 
be found by evaluating the expressions: 
S = XY 

Q = M / SmodW 
Z = (S + QM)/W 

where 

W = 2 N 
W>4M 

M' is chosen such that M' M = -1 mod W 
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Using Montgomery's method in a hardware encryption device can reduce the 
complexity of the device. However, a drawback of this approach is that the result, Z, is not 
the exact result desired for X Y mod M; rather, the result, Z, is a scaled result. In order to 
efficiently scale the result (i.e. reduce the scaling operation to a bit shift in binary numerical 
calculations), W must be chosen to be a power of two (i.e., N must be an integer), which 
limits the possible values of W for a given modulus, M. In addition, the approach presented 
above requires three sequential multiplication operations with very large operands (X and Y, 
represented in binary form, may have more than 1024 bits each, producing multiplication 
result with 2048 or more bits) which are very time consuming operation on a general purpose 
digital computer systems. 

Based on the foregoing, it is desirable to provide improved techniques for encryption. 
It is also desirable to have improved techniques for implementations of encryption protocols 
that achieve acceptable performance for longer keys, in a hardware device that has a practical 
gate structure. 
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SUMMARY OF THE INVENTION 

The foregoing needs, and other needs and objects that will become apparent for the 
following description, are achieved in the present invention, which comprises, in one aspect, 
a method for encryption and decryption of electronic messages based on an encryption 
protocol using a pre-computation and dual-pass modular operation approach to implement 
the encryption protocol efficiently in electronic integrated circuits. A message that is 
encrypted according to the encryption protocol is received. At least one part of another 
message is generated based on the received message and a modular operation. The modular 
operation is based on two applications of Montgomery's method with a constant chosen not 
to be a power of two number, two operands, and a modulus. The second electronic message 
is created by pre-computing another constant based on the modulus. An intermediate result 
is determined and stored in memory based on a first application of Montgomery's method for 
the modular operation, one operand, and the pre-computed constant. A final result is 
determined based on a second application of Montgomery's method for the modular 
operation, the intermediate result, and the other operand. 

According to other aspects, the encryption protocol may include, but is not limited to, 
Rivest-Shamir-Adleman (RSA), Diffie-Hellman, and digital signature algorithm (DSA) 
protocols. The modular operations include, but are not limited to, modular multiplication and 
modular exponentiation. The modular arithmetic may be performed based on a residue 
number system (RNS) using RNS representations in two bases, with the RNS bases chosen 
such that one base extends the other to a total of a larger base RNS system, and using 
conversions between the primary and extended RNS bases. For modular multiplication, two 
registers files may be used, and for modular exponentiation, four register files may be used. 
An array of multiplier circuits and an array of modular reduction circuits may be used for 
both modular multiplication and modular exponentiation. 



50325-0550 (Seq. No. 4166) 



-4- 



Application of STOJANCIC, et aL 

According to other aspects, additional methods, apparatuses, and computer-readable 
media that implement the approaches above are described. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

The present invention is depicted by way of example, and not by way of limitation, in 
the figures of the accompanying drawings and in which like reference numerals refer to 
similar elements and in which: 

FIG. 1 is a flow diagram that depicts an approach for using a pre-computed constant 
and dual-pass modular operations, according to an embodiment; 

FIG. 2 is a flow diagram that depicts an approach for using a pre-computed constant 
and dual-pass modular operations with RNS representations, according to an embodiment; 

FIG. 3 A is a block diagram depicting an example of part of an apparatus for 
performing modular multiplication, according to an embodiment; 

FIG. 3B is a block diagram depicting an example of part of an apparatus for 
performing modular exponentiation, according to an embodiment; 

FIG. 4 is a flow diagram that depicts the inbound and outbound IPSec processing 
procedures; 

FIG. 5 is a diagram that depicts an overview of IKE Phases 1 and 2; 

FIG. 6 is a block diagram of an IKEON processor, according to an embodiment; 

FIG. 7 is a diagram that depicts how a modular ALU is attached to the processor's 
execution unit of the Reduced-Toaster Micro Controller (RMC) core engine; 

FIG. 8 is a flow diagram that depicts the control flow for modular multiplication in 
RNS, according to an embodiment; 

FIG. 9 is a flow diagram that depicts the data flow for modular multiplication in RNS, 
according to an embodiment; 

FIG. 10 is a block diagram that depicts a modular multiplier circuit, according to an 
embodiment; 
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FIG. 1 1 is a block diagram that depicts a modular reduction circuit, according to an 
embodiment; 

FIGS. 12A-12B are a flow diagram that depicts the control flow for modular 
exponentiation in RNS, according to an embodiment; 

FIG. 13 is a flow diagram that depicts the data flow for modular exponentiation in 
RNS, according to an embodiment; 

FIG. 14 is a diagram of a single raw architecture for RNS base conversion, according 
to an embodiment; 

FIG. 15 is a diagram that depicts an example of a counter/compressor structure; 

FIG. 16 is a diagram that depicts another example of a counter/compressor structure; 

FIG. 17 is a diagram that depicts an array structure that uses only 4:2 compressors, 
according to an embodiment; 

FIG. 18 is a diagram that depicts an implementation of RNS to binary conversion, 
according to an embodiment; 

FIG. 19 is a diagram of an RNS conversion pipeline that provides input to an array of 
4:2 compressors, according to an embodiment; 

FIG. 20 is a diagram of an array of 6 levels of 4:2 compressors, according to an 
embodiment; 

FIG. 21 is a diagram of eight planes of pipelined hardware for simultaneous 

execution, according to an embodiment; 

FIG. 22 is a diagram of vector dw/v computations, according to an embodiment; and 
FIG. 23 is a block diagram that depicts a computer system upon which embodiments 

of the invention may be implemented. 
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DETAILED DESCRIPTION OF THE INVENTION 

A method and apparatus for pre-computation and dual-pass modular operation 
approach to implement encryption protocols efficiently in electronic integrated circuits is 
described. In the following description, for the purposes of explanation, numerous specific 
details are set forth in order to provide a thorough understanding of the present invention. It 
will be apparent, however, to one skilled in the art that the present invention may be 
practiced without these specific details. In other instances, well-known structures and 
devices are depicted in block diagram form in order to avoid unnecessarily obscuring the 
present invention. 

In the following description, the various functions shall be discussed under topic 
headings that appear in the following order: 
L FUNCTIONAL OVERVIEW 

A. Example Method for Modular Multiplication 

B . Example Method for Modular Exponentiation 

C. Residual Number System Implementation 
H STRUCTURAL OVERVIEW 

A. Example Apparatus for Modular Multiplication 

B . Example Apparatus for Modular Exponentiation 
EL ACRONYMS 

IV. REVIEW OF DPSEC AND IKE PROCESSING 

A. IPSec Packet Processing Procedures 

B . Introduction to IKE Processing 

C. IKE Phase 1 Overview 

D. IKE Phase 2 Overview 
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V. RSA PROTOCOL 

A. RSA Key Generation 

B . RSA Signature Generation and Verification for Authentication 

C. RSA Encryption/Decryption Introduction 

D. RSA Encryption/Decryption Using Single-Radix Conversion 

E. RSA Encryption/Decryption Using Mixed-Radix Conversion 
VL DIGITAL SIGNATURE ALGORITHM (DSA) PROTOCOL 

A. DSA Key Generation 

B. DSA Signature Generation and Verification 
W. DIFFffi-HELLMAN PROTOCOL 

VIE. IKEON ASIC ARCHITECTURE 

A. IKEON High Level Block Diagram 

B . IKEON Modular Arithmetic Processor 

C. Modular Arithmetic Processor Instructions 

IX. MODULAR ALU ARCHITECTURE 

A. Preliminary Definitions 

B. Dual-Pass Montgomery Multiplication with Precomputation 

C. General Montgomery Exponentiation 

D. Dual-Pass Montgomery Exponentiation with Precomputation 

X. COMPUTATIONS IN THE RESIDUE NUMBER SYSTEM (RNS) 

A. Introduction to RNS 

B . Modular Multiplication in RNS 

C. Modular Exponentiation in RNS 

D. RNS Base Extension and Conversion from One RNS Base to Another 
RNS Base 
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XL MODULAR REDUCTION FOR RNS CONVERSIONS 

XII. MODULAR REDUCTION FOR MODULAR EXPONENTIATION LOOP 

Xm. FINAL BINARY RESULT COMPUTATION 

XIV. ADDITIONAL ARCHITECTURAL CONSIDERATIONS 

XV. HARDWARE OVERVIEW 

XVI. EXTENSIONS AND ALTERNATIVES 
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I. FUNCTIONAL OVERVIEW 
Techniques are provided for a method of using pre-computation and dual-pass 
modular operations to implement encryption protocols efficiently in electronic hardware. 
According to one embodiment, a modular operation is performed using pre-computed 
constant and a dual-pass implementation based on Montgomery's method with a another 
constant that is selected to not be a power of two integer number. The pre-computed 
constant provides for the correct final result at the end of the second pass instead of a scaled 
final result. The first pass is based on the first operand and the pre-computed constant and 
provides an intermediate result. The second pass is based on the intermediate result and the 
second operand and provides the final result. 

A. Example Method for Modular Multiplication 
FIG. 1 is a flow diagram that depicts an approach for using a pre-computed constant 
and dual-pass modular operations, according to an embodiment. Assume that the modular 
operation is a modular multiplication of the two operands, X and Y, based on a modulus, M, 
which can be expressed as X Y mod M. In block 1 10, the two operands and the modulus are 
identified. For example, the operands X and Y may be determined according to a decryption 
step of the encryption protocol being used to decrypt an encrypted message, and the modulus 
was selected for the message prior to generating the encrypted message. 

In block 120, a constant, W, is selected and another constant, R, is pre-computed. For 
example, the constant, W, is chosen such that W > 4 M. The pre-computed constant, R, is 

2 

determined by the expression R = W mod M. Other constant values that are required in 
subsequent steps may be determined prior to the first pass of the dual passes. For example, a 
negative multiplicative inverse of the modulus M, denoted by M', may be selected such that 
M'M = -lmodW. 
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In block 130, the first pass of the dual-pass approach is performed to determine the 
intermediate result is performed based on the first operand, the pre-computed constant, and 
Montgomery's method. For example, if the modular operation is a modular multiplication, 
the intermediate result, S, is determined using Montgomery's method according to the 
following expressions: 
Z = XR 

U = ZM'modW 
S = (Z + UM)/W 

In block 140, the second pass of the dual-pass approach is performed to determine the 
final result is performed based on the intermediate result, the second operand, the selected 
constant, and Montgomery's method. For example, if the modular operation is a modular 
multiplication, the final result, F, is determined using Montgomery's method according to the 
following expressions: 
Z = YS 

U = ZM'modW 
F = (Z + UM)/W 
The final result, F, is the desired result from the modular operation, such as 
X Y mod M in this example. 

The application of Montgomery's method may be expressed as Mont(A,B,C) that is 
defined to include the following expressions: 
D = AB 

E = DM' mod W 
C = (D+EM)/W. 

Based on the definition above for Mont(A,B,C), the dual-pass approach for modular 
multiplication may be expressed as: 
Mont(X,R,S) 
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Mont(S,Y,F). 

B . Example Method for Modular Exponentiation 
As another example, assume that the modular operation is a modular exponentiation 

Y 

such as X mod M for a modulus M. The two operands and the modulus are identified, as 
discussed above with regard to block 110, and the constants W, R, and M' are determined as 
discussed above with respect to block 120. 

For block 130, the first pass of the dual-pass approach to determine the intermediate 
result, S, for modular exponentiation is determined using Montgomery' s method according to 
the following expressions: 
Z = XR 

U = ZM' mod W 

S = (Z + UM)/W 
For block 140, the second pass of the dual-pass approach to determine the final 
result, F, for modular exponentiation is determined using Montgomery' s method according to 
the following expressions, wherein Y is an n-digit binary number and Yi denotes the i-th 
digit of Y: 

F=l 

for i = 1 to n, 

ifYi = lthen 
Z = SF 

U = Z M' mod W 
F=(Z + UM)/W 

endif 
Z = SS 

U = Z M' mod W 
S = (Z + UM)/W 
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end loop 

y 

The final result, F, is the desired result from X mod M. 

Using the Mont(A,B,C) nomenclature above, the dual-pass approach for modular 
exponentiation may be expressed shown below using binary-H and bit indexing approach in 
which the binary value 1011 (e.g., decimal 11) has "i" values of 0123: , other 
implementations may employ other approaches. For example, an implementation may use 
binary-L and a bit indexing approach in which the binary value 1011 has "i" values of 4321. 

Mont(X,R,S) 

F = S 

for i=l to n 

Mont(F,F,F) 

if Yi = 1, then Mont(S,F,F) 
end loop 

While the expressions above are based on using binary-H and bit indexing approach 
in which the binary value 101 1 (e.g., decimal 1 1) has "i" values of 0123, other 
implementations may employ other approaches. For example, an implementation may use 
binary-L and a bit indexing approach in which the binary value 1011 has "i" values of 4321, 
and the dual-pass approach for modular exponentiation may be expressed as: 
F=l 

Mont(X,R,S) 
for i=l ton 

ifYi = l,thenMont(S,F,F) 

Mont(S,S,S) 
end loop 
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C. Residual Number System Implementation 
According to another embodiment, the modular operations are performed using the 
residue number system (RNS) wherein the operands, constants, results, and other quantities 
are represented in at least one RNS base and frequently both RNS bases of a set of two RNS 
bases. Mathematical operations are performed based on the RNS representations, and 
conversions from one RNS base to the other RNS base are made between specified steps in 
the computations for a modular operation. 

To create an RNS representation, an RNS base, V, is generated. The RNS base, V, is 
a set of numbers, v { for i =1 to n, that satisfy the expressions: 

gcd(v i ,v j ) = lfori^j 
where "gcd" denotes the greatest common devisor. The set of integers v., form the 
base of the RNS. Because the greatest common divisor for any two different pairs v^. is 1, 
the integers v { in the RNS base V are referred to as being relatively prime to one another. 
Given the RNS base, V, the RNS representation of a value, X, is: 
X = (x 1? x 2 , x n l? x n ) 

where 

x. = X mod v. 

The values in the RNS representation of X, x., are called the residues of X. Any 
integer X may be represented by such an RNS representation provided that X < V. If an 
integer number larger than V needs to be represented by its residues, the RNS base V can be 
extended with an RNS base W, provided that all base elements, in both V and W, are 
relatively prime. Thus a number X, VW > X > V, can be represented in extended RNS base 
W. This property of the RNS representation will be exploited in the approach described 
later. An RNS representation in one base may be converted into an RNS representation in 
another base as described and used later herein. 
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FIG. 2 is a flow diagram that depicts an approach for using a pre-computed constant 
and dual-pass modular operations with RNS representations, according to an embodiment. In 
block 210, a constant, W, is selected based on the modulus, M. For example, W is chosen 
such that W > 4 M. 

In block 220, another constant, R, is pre-computed. For example, pre-computed 
constant, R, is determined by the expression R = w 2 mod M. Other constant values that are 
required in subsequent steps may be determined prior to the first pass of the dual passes. For 
example, a "negative" multiplicative inverse of the modulus M, denoted by M', may be 
selected such that M' M = -1 mod W. 

In block 230, operands for the modular operation are identified. For example, if the 
modular operation is a modular multiplication of the form X Y mod M, the X and Y are 
identified as the operands for the modular multiplication operations. 

In block 240, two RNS bases are selected and RNS representations are generated in 
one or both RNS bases for the selected constant, pre-computed constant, modulus, and 
operands. An example of an RNS base is the RNS base V shown above. An example of an 
RNS representation is the RNS representation for the operand X in the RNS base V as shown 
above. RNS representations of other quantities may be determined, in addition to those 
listed, such as for the "negative" multiplicative inverse of the modulus, M' . 

In block 250, the first pass of the dual-pass approach is performed in RNS arithmetic 
to determine an intermediate result, S, expressed in an RNS representation, based on the RNS 
representations of the first operand, X, the selected constant, W, and the pre-computed 
constant, R, and also based on Montgomery's method. For example, as discussed later 
herein, an RNS form of the expressions discussed above with respect to block 130 for 
modular multiplication and modular exponentiation may be used. In addition to the RNS 
form of such expressions, additional conversion steps are included to convert values from 
one RNS base to between selected steps. 
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In block 260, the second pass of the dual-pass approach is performed in RNS 
arithmetic to determine a final result, expressed in an RNS representation, based on the 
second operand, Y, and on the RNS representations of the intermediate result, S, and the 
selected constant, W, and also based on Montgomery's method. For example, as discussed 
later herein, an RNS form of the expressions discussed above with respect to block 140 for 
modular multiplication and modular exponentiation may be used. In addition to the RNS 
form of such expressions, additional conversion steps are included to convert values from 
one RNS base to another RNS base between selected steps. 

In block 270, the RNS representation for the final result is converted to a binary 
representation. 

According to other aspects, the residues in each RNS base are seventeen-bit residues, 
there are 64 residues in each RNS base, and the residues in each base are selected from a 
range of 2 to 2 . 

E. STRUCTURAL OVERVIEW 
Techniques are provided for an apparatus that uses pre-computation and dual-pass 
modular operations to implement encryption protocols efficiently in electronic hardware. 
According to one embodiment, an apparatus for performing both modular multiplication and 
modular exponentiation includes the array of modular reduction circuits, and set of register 
files. For example, if the apparatus performs modular multiplication, the apparatus may 
include an array of multiplier circuits, an array of modular reduction circuits, and two register 
files. As another example, if the apparatus performs modular exponentiation, the apparatus 
may include an array of multiplier circuits, an array of modular reduction circuits, and four 
register files. 
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A. Example Apparatus for Modular Multiplication 
FIG. 3 A is a block diagram depicting an example of part of an apparatus 310 for 
performing modular multiplication, according to an embodiment. The example of FIG. 3 A is 
based on having 64 17-bit residues with RNS representations, although a different number of 
residues of different lengths may be used. Apparatus 310 includes a multiplier circuit array 
320, a modular reduction circuit array 330, and register files 340, 350. 

Multiplier circuit array 320 is depicted in FIG. 3A as an array of 64 17 x 17 multiplier 
circuits for performing multiplication operations. Multiplier circuit array 320 may be 
implemented using an array of 4:2 compressors organized into three levels such that can be 
executed in one clock cycle. Multiplier circuit array 320 may also be implemented with a 
final adder module, a set of 64 34-to-l 7-bit modular reduction circuits, such as modular 
reduction circuit array 330 described below, and a register file, such as register file 340 or 
register file 350 described below, which can be executed in one clock cycle. Therefore, the 
multiplication operations performed by multiplier circuit array 320 may be completed in two 
clock cycles. 

Modular reduction array 330 is depicted in FIG. 3A as an array of 64 17-bit modular 
reduction circuits. Modular reduction circuit array 330 receives 34-bit values as input, such 
as from the final adder module of modular multiplier array 320 discussed above, and 
produces 17-bit values as output to the appropriate file register. 

Register file 340 is depicted in FIG. 3 A as having 64 17-bit Z-registers, and register 
file 344 is depicted in FIG. 3 A as having 64 17-bit U-registers. Thus, one Z-register and one 
U-register are provided for each of the 64 residues in the RNS base. As described later 
herein, the data flow for apparatus 310 is arranged such that the same set of Z-registers and 
U-registers are used for each of the two passes represented by blocks 130 and 140. 
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B . Example Apparatus for Modular Exponentiation 
FIG. 3B is a block diagram depicting an example of part of an apparatus 350 for 
performing modular exponentiation, according to an embodiment. The example of FIG. 3B 
is based on having 64 17-bit residues, although a different number of residues of different 
lengths may be used. Apparatus 350 includes a multiplier circuit array 360, a modular 
reduction circuit array 370, and register files 380, 382, 384, and 386. 

Multiplier circuit array 360 and modular reduction circuit array 370 may be 
implemented in the same manner as multiplier circuit array 320 and modular reduction 
circuit array 330 discussed above, and therefore will not be discussed further here. 

Register file 380 is depicted in FIG. 3B as having 64 17-bit Rl-registers, register 
file 382 as having 64 17-bit R2-registers, register file 384 as having 64 17-bit Tl-registers, 
and register file 386 as having 64 17-bit T2-registers. Thus, one Rl -register, one R2-register, 
one Tl -register, and one T2-register are provided for each of the 64 residues in the RNS 
base. As described later herein, the data flow for apparatus 3 10 is arranged such that only the 
set of Rl-registers and the set of R2-registers are used in the first pass of the dual passes as 
represented by block 130 in FIG. 1 , while all four register files are used for the second pass 
of the dual passes as represented by block 140. 
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IIL ACRONYMS 
Acronyms used herein have the following meanings: 



IKEON 


Security processor for IKE acceleration with accompanied crypto 




transform functions 


IKE 


Refers to the Internet Key Exchange according to the RFC-2409. 


NP 


Refers to the Toaster Network Processor 


TMC 


Core VLIW engine in NP ASIC 


RMC 


Reduced NP (TMC) Micro-Controller; core VLIW engine in NP ASIC 


ASIC 


Application Specific Integrated Circuit 


DES 


Data Encryption Standard; Symmetric Encryption based on 56 bit Key 


3DES 


Triple DES, such as using 3x56 bit keys 


MD5 


Message Digest 5; Asymmetric Encryption 


SHA-1 


Secure Hash Algorithm- 1; Asymmetric Encryption 


AES 


Advance Encryption Standard 


SA 


Security Association 


SAD 


Security Association Data Base 


SPD 


Security Policy Data Base 


IPSec 


Internet Protocol Security 


A TLX 
AH 


Authentication Header 


ESP 


Encapsulated Security Payload 


RNS 


Residue Number System 


RNR 


Redundant Number Representation 


Ra,Rb 


Source Data Registers 
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Rd 



Destination Data Register 



Rm 



Modulus register 



CRT 



Chinese Remainder Theorem 



GCD 



Greatest Common Divisor 



RND 



Random Number Generator 



RSA 



Rivest/Shamir/Adleman Encryption Algorithm 



DSA 



Digital Signature Algorithm 



IV. REVIEW OF IPSEC AND IKE PROCESSING 



The Internet Protocol Security (IPSec) packet processing complex consists of a 
set of components that include: MIPS (RP) processor, NP ultra-fast forwarding engine, a 
feeder chip and the IKEON processor. In addition to standard, packet header mode of 
operation, the Ml packet mode engages the processing complex in IPSec packet 
processing. The following are some of the scenarios pertaining to the IPSec packet mode 
of operation: an IP packet that is IPSec encrypted is detected and it is directed to the 
processor complex for decryption; a line card receives a plain IP packet (i.e. not 
encrypted) and the forward decision determines that the packet should be forwarded to 
an outgoing interface that requires the packet to be IPSec encrypted; the RP generates a 
management packet (during the SA negotiation process) that needs to be IPSec 
encrypted, etc. 

The outbound IPSec traffic requires SPD lookup to be performed for policy 
determination. Then the packet is encrypted accordingly (after the S A is established) and 
delivered to the forwarding engine. 
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The following steps are made during the outbound IPSec packet processing: 

• Retrieve IP source/destination address 

• Retrieve IP source/destination port 

• Retrieve L4 protocol id 

The above values form the selectors used as search keys for the SPD lookup. 
The policy found will determine whether IPSec encryption is permitted or denied to the 
packet. If denied, a special SA tag will be added to the packet header so that the packet 
will be passed without encryption taking place. If the policy mandates encryption, 
further verification is performed to determine if a S A has been negotiated. If S A already 
exists, the SA handle will be returned and added to the packet header. If not, the message 
will be sent to the RP to start IKEON program for IKE-SA negotiation. 

The other case of a new S A negotiation is related to the S A life time. A lifetime 
timer associated with each negotiated S A may expire, in which case a new S A will be 
negotiated to replace Ms existing one. The RP and NP complex will be responsible to 
manage the S A lifetime timer and should send a message to the IKEON to initiate S A 
re-negotiation before timer expiration. 

The inbound IPSec processing extracts a set of parameters from the packet 
headers as IPSec protocol ID and Security Parameter Index (SPI). These two parameters 
form the search key for the SA data base. If the search fails, the packet is discarded and 
relevant statistics is updated. If the search succeeds, a SA handle is returned which is 
appended to the packet to be queued and transmitted to the NP complex. 

The RP and NP will be microcoded to implement the inbound and outbound 
IPSec data path functionality. The NP contains crypto engines that can process 
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DES/3DES encryption/decryption combined with HMAC SHA-l/MD-5 at OC-48 and 
higher rates. 

Before an IPSec packet is sent to NP a control header will be added that carries 
all the control information (such as encryption, decryption or authentication keys, 
offsets, modes, and etc.) required by NP to encrypt, decrypt and/or authenticate the 
packet. 

A. IPSec Packet Processing Procedures 
The following is a short overview of the inbound/outbound IPSec packet 
processing procedures. 

For the inbound IPSec data path: 

•IPSec protocol processor (first cluster TMC) does SAD lookup; 
•Pass the packet to next cluster TMC for decryption; 
•IKE message is always directed to the RP; 
•Decrypted packet is passed to the RP for other processing; and 
•Processed packet is queued in the feeder chip to be sent to the outgoing 
port. 

For the outbound IPSec data path: 

•IPSec processor (first cluster TMC) performs SPD lookup; 

•If SA available, pass the packet to the next cluster TMC for encryption; 

•Packet with no SA available is directed to the feeder (RP); 

•RP starts SA negotiation through the IKEON; and 

•IKEON negotiates keys. 
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•Processed packet is queued in the feeder to be sent out. 
FIG. 4 is a flow diagram that depicts the inbound and outbound IPSec packet 
processing procedures. 

B. Introduction to IKE Processing 

Internet Key Exchange (IKE), as defined in RFC-2409, is the protocol used to 
establish security associations that are needed to generate and refresh keys used in IPSec 
packet processing. IKE was originally called ISAKMP/Oakley - Internet Security 
Association and Key Management Protocol (ISAKMP), defined in RFC 2408. It 
provided a framework to establish security associations and cryptographic keys. IKE still 
uses ISAKMP as its framework but incorporates Oakley and SKEME as its key 
exchange protocol. IKE is made up of two phases as defined in the ISAKMP framework, 
and within these phases Oakley defines a number of modes that can be used. 

Phase 1 is the process where the ISAKMP security association must be 
established. It assumes that no secure channel currently exists and subsequently one is 
established to protect the ISAKMP messages. Two modes are available for use in 
Phase 1: main mode and aggressive mode. Support for main mode is a mandatory 
requirement for IKE, while aggressive mode is optional. Main mode has the advantage 
of being able to protect the identities of the parties trying to establish the S A, while 
aggressive mode has the advantage of being able to use three rather than six message 
flows to establish the ISAKMP S A. 

Phase 2 is where subsequent security associations required by the IPSec 
services are negotiated. 
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The IPSec protocols AH and ESP require that shared secrets are known to all 
participating parties that require either manual key entry or out-of-band key distribution. 
The Internet Key Exchange (IKE) protocol has been designed to meet the stringent 
requirements of the IPSec protocol. It is based on the Internet Security Associations and 
Key Management Protocol (IS AKMP) framework and the Oakley key distribution 
protocol 

IKE offers the following features: 

•Key generation and identity authentication procedures 
•Automatic key refresh 
•Perfect forward security (PFS) 
•Two-phased approach 

•Phase 1 - Establish keys and S A for key exchanges 
•Phase 2 - Establish S As for data transfer 
•Implemented as application over UDP 

•Supports host-oriented (IP address) and user-oriented (long-term identity) 
certificates 

•Uses strong authentication for ISAKMP exchanges 
•No actual keys are shared, only a token used to create keying material 
•Digital signatures (using either DSS or RSA methods) 
•Public key encryption (RSA and revised RSA) 
The following sections present an overview of IKE Phases 1 and 2. FIG. 5 is a 
diagram that depicts an overview of IKE Phases 1 and 2. 
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C. IKE Phase 1 Overview 

During Phase 1, the partners exchange proposals for the ISAKMP SA and 
agree on one. This contains specifications of authentication methods, hash functions and 
encryption algorithms to be used to protect the key exchanges. The partners then 
exchange information for generating a shared master secret: 

•Cookies that also serve as SPIs for the ISAKMP SA 

•Diffie-Hellman values 

•Nonces (random numbers) 

•Optionally exchange IDs when public key authentication is used 
Both parties then generate keying material and shared secrets before 
exchanging additional authentication information. 

D. IKE Phase 2 Overview 

During Phase 2, the partners exchange proposals for protocol S As and agree on 
one. This contains specifications of authentication methods, hash functions and 
encryption algorithms to be used to protect packets using AH and/or ESP. To generate 
keys, both parties use the keying material from a previous Phase 1 exchange and they 
can optionally perform an additional Diffie-Hellman exchange for PFS, 
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The Phase 2 exchange is protected by the keys that have been generated during 
Phase 1, which effectively ties a Phase 2 to a particular Phase 1. 

V. RSA PROTOCOL 
The message space and ciphertext space for the RSA public key encryption are 
both Z/mZ, m = 1, 2048. The modulus m = pq is the product of two randomly chosen 
distinct prime numbers p and q, both of 1024 bits, m is in public knowledge. 

A. RSA Key Generation 
Keys for the RSA protocol are generated as follows: 

1. Generate two distinct random prime numbers, p and q, of similar size (both 
with 1024 bits). 

2. Compute m = pq and u = (p - l)(q - 1). 

3. Select a random public exponent e (recommended value e = 3 or 2 16 + 1 ), 

1 < e < u, such that gcd(e, u) = 1. 

4. Compute integer d from de = 1 (mod u). 

5. Broadcast public key (m, e). 

6. Keep private key d and primes p,q secret. 

B. RSA Signature Generation and Verification for Authentication 
From the original "Text" the "source" generates and signs a message M, and 
then the "destination" verifies signature and recovers the message M. 
RSA signature generation is performed as follows: 

(1) Compute M = transf(Text), for a given transform function "transfO". 



50325-0550 (Seq. No. 4166) 



-27- 



Application of STOJANCIC, et. al. 



(2) Compute S = M e mod m. 

(3) S is the signature from the source. 

RSA signature verification is performed as follows: 

(1) Get public key (m, e). 

(2) Compute and verify M = S d mod m. 

(3) Recover Text = transf _1 (M). 

C. RSA Encryption/Decryption Introduction 
From the original "Text" the source generates and encrypts a message M. 
Encryption is performed as follows: 

(1) Compute M = transf(Text), for a given transform function "transf()'\ 

(2) Compute C = M e mod m. 

(3) C is the encrypted message. 
Decryption is performed as follows: 

(1) Compute M = C d mod m. 

(2) Recover Text = transf _1 (M). 

If m = pq is a 2048 bit RSA modulus where p and q are 1024 bit primes, 
computing C = M d mod m (or S = M d mod m) for a message M can be made more 
efficient by using the CRT decomposition into residues based on p and q. Since the p,q 
pair is known, compute S x = M d mod p and S 2 = M d mod q and then determine S by using 
the CRT. This process, as presented below, allows for two 1024 bit modular ALUs to 
operate in parallel to produce 2048 bit result. 
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D. RSA Encryption/Decryption Using Single-Radix Conversion 

Given the following: p, q - 1024 bit primes; C - 2048 bit cipher text; M - 2048 
bit message to be deciphered; and d - 2048 bit exponent; the single-radix conversion 
(SRC) for RSA encryption/decryption is performed as follows: 

(1) Compute dp = d (mod p-1); d q = d (mod q-1) 

(2) Compute C p = C (mod p); C q = C (mod q) 

(3) Compute M p = C p d ? (mod p); M q = C q (mod q) 

(4) Compute M = (M p q(q _1 mod p) + M q p(p _1 mod q)) (mod m) 

(5) Recover Text = transf _1 (M). 

Note: this SRC approach relies on 2048 bit arithmetic, whereas the MRC 
approach below only relies on 1024 bit arithmetic. 

E. RSA Encryption/Decryption Using Mixed-Radix Conversion 

Given the following: p, q - 1024 bit primes; C - 2048 bit cipher text; M - 2048 
bit message to be deciphered; and d - 2048 bit exponent; the mixed-radix conversion 
(MRC) for RSA encryption/decryption is performed as follows: 

(1) Compute d p = d (mod p-1); d q = d (mod q-1) 

(2) Compute C p = C (mod p); C q = C (mod q) 

(3) Compute M p = Cp d P (mod p); M q = C q ^ (mod q) 

(4) Compute M q .p = M q - M p 

(5) Compute M = M p + [ (Mq.ptp" 1 mod q)) (mod q) ] p 

(6) Recover Text = transf 4 (M). 
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The MRC approach described above is used to decompose 2048 bit 
exponentiation problem into two 1024 bit exponentiation problems. The 
M p = C p dp (modp) andM q = Cq^ (modq) as well as modular reductions 
are computed in parallel on two separate modular arithmetic ALUs with 1024 bit 
operands. The final recombination and recovery of 2048 bit result will be done based on 
the MRC algorithm as described above. 

VL DIGITAL SIGNATURE ALGORITHM (DSA) PROTOCOL 
The digital signature algorithm (DSA) is a U.S. Federal Information 
Processing Standard (FIPS 186) called the Digital Signature Standard (DSS). The 
signature mechanism here requires a hash function based on the SHA-1 algorithm. 

A. DSA Key Generation 
From the original "Text" the "source" generates and signs a message M, and 
the "destination" verifies signature and recovers the message M, according to the 
following: 

(1) Select a prime number q such that 2 159 < q < 2 160 . 

(2) Select a prime number 2 511+64a < p < 2 512+64a with aE[0,8] such that q 
divides (p - 1). 

(3) Select an integer g and compute h = g (p_1)/q mod p such that h != L 

(4) Select a random integer b such that 1 <- b <= q - 1 . 

(5) Compute y = h b mod p. 

(6) The public key is represented by (p, q, h, y) while b is the private key. 
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B. DSA Signature Generalization and Verification 
DSA signature generation is performed as follows: 

(1) Select a random secret integer 0 < k < q. 

(2) Compute t = (h k mod p) mod q. 

(3) Compute k" 1 mod q 

(4) Compute s = k" 1 [SHA1(M) +bt] mod q. 

(5) The signature for M is (s, t). 

DSA signature verification is performed as follows: 

(1) Get (p, q, h, y). 

(2) Verify 0 < s,t < q; if not, then reject the signature. 

(3) Compute w = s _1 mod q and SHA1(M). 

(4) Compute u,= wSHAl(M) mod q and u 2 = tw mod q. 

(5) Compute v = (h ul y u2 mod p) mod q. 

(6) If v = r, signature verified. 

The size of q is fixed at 160 bits, while the size of p can be any multiple of 64 
between 512 and 1024 bits inclusive. FEPS 186 does not permit primes p larger than 
1024 bits. Signature generation requires one modular exponentiation (1024 bit key), one 
modular inverse with a 160-bit modulus, two 160-bit modular multiplications, and one 
addition. The 160-bit operations are relatively minor compared to the exponentiation. 
DSA has the advantage that the exponentiation can be precomputed and need not be 
done at the time of signature generation. By comparison, no precomputation is possible 
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with the RS A signature scheme. The major portion of the work for signature verification 
is two exponentiations modulo p, each to 160-bit exponents. 

VIL DIFFIE-HELLMAN PROTOCOL 
In the Basic Diffie-Hellman key agreement protocol, two parties 'A' and 6 B' 
exchange initial messages over an open channel. Both 6 A' and 'B' compute identical 
secret K. An appropriate prime p and generator 2 <= h <= p-2 are agreed upon and 
published. 

The following procedure illustrates the process: 

(1) 'A' chooses a random number x, such that 1 <= x <= p-2, and sends out 
a = h x mod p while x remains private. 

(2) 'B' chooses a random number y, such that 1 <= y <= p-2, and sends out 
b = h y mod p while y remains private. 

(3) 4 B ? computes the shared key K: K = a y mod p. 

(4) ' A' computes the shared key K: K = b x mod p. 

VIII. IKEON ASIC ARCHITECTURE 
According to one embodiment, an IKEON ASIC provides a VLIW engine for 
standard IKE protocol processing (32/64 bit non-modular operations) and master control 
operations in addition to a set of DPSec transform functions (DES, 3DES, MD5, SHA-1) 
and modular arithmetic on large data operands (up to 2048) for encryption protocols 
such as RSA, DSA, and Diffie-Hellman protocols, as implemented in hardware. The 
transform functions and modular arithmetic processors are implemented as coprocessors 
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appended to the main ALU within the master execution unit of the VLIW RMC Core 
engine. Both 32- and 64-bit operations are supported as well as byte-enabled writes. 

A. IKEON High Level Block Diagram 

FIG. 6 is a block diagram of an IKEON processor 600, according to an 
embodiment. FIG. 6 depicts IKEON processor 600 communicatively coupled to an 
external routing processor (RP) 610. IKEON processor 600 includes an RP interface 
620, a local data memory 630, an instruction memory 640, an exception handler 650, a 
processor core 660, and a set of coprocessors that may be used for encryption protocol 
computations, such as for the RSA, DSA, and Diffie-Hellman protocols. 

RP interface 620 provides communications with the external RP 610. Local 
data memory 630 is shared between the RP and RMC processors, and local data memory 
630 may include 24 KB of local data SRAM. Instruction memory 640 is accessible by 
the external RP processor and may include 16 KW of RMC instruction memory. 
Processor core 660 may be implemented as a VLIW engine with double, non-modular, 
32/64 bit execution units. 

The set of coprocessors include the following: a modular ALU-1 670, a 
modular ALU-2 672, a CRT module 676, a GCD module 678, a DES/AES module 680, 
a MD5/SHA-1 module 682, a MEXT module 684, and a RNG module 686. 

Modular ALU-1 670 and modular ALU-2 672 are arithmetic logic units 
(ALUs) that perform modular arithmetic on operands. For example, modular ALU-1 
670 and modular ALU-2 672 may be 1024 bit wide ALU's that perform modular 
arithmetic on 1024 bit operands. As explained above, for an encryption protocol that 
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uses a 2048 bit key, the 2048 bit modular operations may be decomposed into two 1024 
bit modular operations and performed in parallel, such as on ALU-1 670 and modular 
ALU-2 672. 

CRT module 676 is a Chinese Remainder Theorem module that reduces a 
larger operand into two smaller operand and later recombines smaller operands into a 
larger operand. For example, for an encryption protocol that uses a 2048 bit key, CRT 
module 676 reduces the 2048 bit operands into two 1024 bit operands, and later 
recombines 1024 bit operands into a 2048 bit operand. 

GCD module 678 performs multiplicative inverse computations (i.e., greatest 
common divisor or GCD). For example, GCD module 678 may be used to find an 
integer Z such that AZ = 1 mod (M) for a modulus M. Z is called the multiplicative 
inverse of A modulo M. 

DES/AES module 680 performs computations according to the symmetric 
encryption standard (DES) based on a 3x56 bit key and according to the advanced 
encryption standard (AES). 

MD5/SHA-1 module 682 performs computations according to the Message 
Digest 5 (MD5) program and the Secure Hash Algorithm- 1 (SHA-1) program. 

MEXT module 684 performs non-modular 32/64 bit multiplication/di vision. 

RNG module 686 performs random number generation (RNG). 

B . IKEON Modular Arithmetic Processor 
The IKEON modular arithmetic processor implements modular arithmetic, 
including modular exponentiation, on up to 2048 bit operands. For maximum efficiency 



50325-0550 (Seq. No. 4166) 



-34- 



Application of STOJANCIC, et. al. 



two ALUs, 1024 bits wide, are provided that operate in parallel on either the original 
1024 bit cipher text, or on one of the two 1024 bit operands obtained by CRT residue 
decomposition of 2048 bit cipher text; this is based on the 2048 bit key decomposition 
into two 1024 bit prime factors - modulus m = pq, where p and q are 1024 bit wide. This 
mechanism will be explained in detail in a later section. 

Modular ALUs perform the following operations: 



(1) Modular addition 

(2) Modular subtraction 

(3) Modular reduction 

(4) Modular multiplication 

(5) Modular exponentiation 



FIG. 7 is a diagram that depicts how a modular ALU is attached to the 
processor's execution unit of the Reduced-Toaster Micro Controller (RMC) core engine. 



The following instructions will be included in the RMC instruction set to 
support modular and large operand arithmetic operations (the sample instruction set 
below operates on up to 2048 bit operands): 



C. 



Modular Arithmetic Processor Instructions 



(1) 



ModAdd: Rd = (Ra + Rb) mod Rm 



(2) 



ModSub: Rd = (Ra - Rb) mod Rm 



(3) 



ModRed: Rd = Ra mod Rm 



(4) 



ModMult: Rd = RaxRb mod Rm 
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(5) ModExp: Rd = Ra Rb mod Rm 

IX. MODULAR ALU ARCHITECTURE 
The following presents the algorithmic aspect of the modular ALU 
architecture, according to one embodiment. The basic arithmetic operations of addition, 
subtraction, and multiplication are performed in Z/Zm, the integers modulo m (m is a 
large positive integer, not necessaraly a prime). The encryption protocols, such as RSA, 
DSA and Diffie-Hellman, require algorithms for performing multiplication and 
exponentiation in Z/mZ. Techniques described here are directly mapped into hardware 
structures of the IKE- ALU unit. 

A. Preliminary Definitions 
Multiple-precision modular operations are performed in Z/mZ, and the integers 
are modulo m, where m is a multiple-precision positive integer. For example, 
m = (m 0 , m n 4), 
x = (x 0 , x^), and 

y = (yo> y n -i) 

are non-negative integers in radix 2 representation such that x,y < m. Methods for 
computing x+y (mod m), x-y (mod m), x*y (mod m), x' 1 (mod m) are described below. 

For the bit indexing approach used in the representations above, the bits 
(e.g., m is x i? and y 4 ) are numbered from 0 to n from left to right. For example, the 
binary value 101 1 (e.g., decimal 1 1) has "i" values of "0123." However, other 
implementations may use other bit indexing approaches. For example, the bits may be 
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numbered from 1 to n from right to left, such that the binary value 101 1 has "i" values of 
4321. 

If z is any integer, then z (mod m) is called modular reduction of z with respect 
to modulus m. 

B . Dual-Pass Montgomery Multiplication with Precomputation 
According to one embodiment, dual-step Montgomery multiplication with 
precomputation to provide the exact (i.e., unsealed) output is based on the input: 
x, y, m, w 

where 

x,y<m, 
w>=4m 

m'm = -1 (mod w); 
and precomputing the constant, R, as follows: 

R = w 2 (mod m) 
to provide the following output: 

r = xy (mod m) 
based on the following steps: 

(1) z = xR 

(2) u = zm' (mod w) 

(3) r = (z + um)/w 

(4) z = yr 

(5) u = zm' (mod w) 
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(6) r = (z+um)/w. 

If Mont(x,y,r) denotes one-pass montgomery multiplication, then the steps in 
the multiplication approach above can be written as: 

(1) Mont(x,R,r) 

(2) Assign: x <== r 

(3) Mont(x,y,r). 

where x,y are inputs to each pass of the dual passes, respectively, and r,r are the 
outputs of each of the dual passes, respectively 

C. General Montgomery Exponentiation 
According to one embodiment, general Montgomery exponentiation is 
performed in a loop based on the inputs: 
x, y, m, w 

where 

x,y<m, 
w>=4m 

m'm - -1 (mod w); 
to provide the following output: 

t = x y (mod m) 
based on the following steps: 

(1) t<= l,r<=x 

(2) for (i = 1 to n), 

if (yi = 1) then {t <= tr (mod m)} 
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r <= r 2 (mod m) 

If Mont(x,y,r) denotes one-pass montgomery multiplication, then the steps in 
the general exponentiation approach above can be written in binary-H form and using the 
bit indexing approach in which the binary value 1011 has "i" values of 0123 as: 

(1) t<= l,r<=x 

(2) for (i = 1 to n), 

Mont(t,t,t) 

if (y i= 1) then {Mont(r,t,t)} 
Other approaches may be used as well. For example, using binary-L and a bit 
indexing approach in which the binary value 1011 has "i" values of 4321 , the above 
steps may be expressed as: 

(1) t<= l,r<=x 

(2) for (i = 1 to n), 

if (y { = 1) then {Mont(r,t,t)} 
Mont(r,r,r) 

D. Dual-Pass Montgomery Exponentiation with Precomputation 
Dual-Pass Montgomery exponentiation is performed in a loop based on the 

inputs: 

x, y, m, w 

where 

x,y<m, 
w>=4m 

m'm = -1 (mod w); 
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and precompiling the constant, R, as follows: 

R = w 2 (mod m) 
to provide the following output: 

t = x y (mod m) 
based on the following steps: 
(1.0) t=l, 

(1.1) z = xR 

(1.2) u = zm' (mod w) 
% (1.3) r = (z + um)/w 
jfl (2.0) fbr(yi=lton) 

i (2.1) if (yi=l)then 

HJ (2.1.1) z = rt 

fpj (2.1.2) u = zm' (mod w) 

g (2.1.3) t = (z + um)/w 

}_» endif 

(2.2) z = rr 

(2.3) u = zm' (mod w) 

(2.4) r = (z + um)/w 

(2.5) Return. 
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X. COMPUTATIONS IN THE RESIDUE NUMBER SYSTEM (RNS) 

A. Introduction to RNS 

Given: 

moduli m 1? m n 
such that: 

and 

gcd(m i? mj)= 1 fori !=j, 
an integer x that satisfies: 
x < m 

has a modular representation or mixed-radix representation of: 
x = (xi, x n ) 

where: 

Xi = x (mod m^). 

The set of modular representations for all integers x in the range x < m is called 
a residue number system (RNS). 

B . Modular Multiplication in RNS 

According to one embodiment, modular multiplication in RNS is performed in 
a loop based on the inputs: 

x,y,m,w,v; 
and based on the RNS representations: 

W = w 1 *w 2 ***w n _ 1 *w ns 

V = Vi*V2***V n _i*V n , 
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X W = (X W 2 ? X^yjj), 

y w ~ (YwI' Ywn)' 

Xy = (XyJ, Xyj^), 



where 



gcd(W,V) = l; 

gcd(w i? Wj)= l,fari !=j; 

gcd(Vi,Vj) = 1, for i != j; and 

w >= 4m. 
to obtain the output: 

xy (mod m) 
by precomputing the following constant: 

R = W 2 (mod m), 

and by precomputing the following RNS representations and constants: 

x wi> y wi' x vi> Yvi' 111 wi' m vi' 
w _1 v i from w _1 yi w = 1 (mod Vj) 

^wi' ^vi' 

Ni w from Ni W (W/ Wi ) = 1 (mod w { ) for aU i = (1, n) 
Ni v from Ni V (V/vi) = 1 (mod v { ) for all i = (1, n). 
based on the following steps: 

(1) z wi = x wi R wi (modw i ) 
Zyi = x vi R vi (mod 

(2) u wi =z wi m' w i(mod Wi) 
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(3) (u vl , u vn ) <= (u wl , u^); RNSbaseW conversion into RNSbaseV 

(4) r vi = (z^ + ^m^w"^ (mod Vj) 

(5) (r wl , r wn ) <— (r vl , r^); RNSbaseV conversion into RNSbaseW 

(6) z wi = y wi r wi (modw i ) 
z vi = y vi r vi ( m °d vi) 

(7) u wi = z wi m , wi (modw i ) 

(8) (u v i, Uyn) <== (u w ^, u wn );RNSbaseW conversion into RNSbaseV 

(9) r vi = (z vi + u^m^w" 1 ^ (mod v t ) 

(10) Compute r v (mod v) = xy (mod m), 

FIG. 8 is a flow diagram that depicts the control flow for modular 
multiplication in RNS, according to an embodiment. In FIG. 8, rectangular blocks on 
the same horizontal level overlap execution times. 

FIG. 9 is a flow diagram that depicts the data flow for modular multiplication 
in RNS, according to an embodiment. FIG. 9 includes an array 910 that has 64 17x17 
multiplier circuits, an array 920 that has 64 17-bit modular reduction circuits, a register 
file 930 that has 64 17-bit z-registers, and a register file 940 that has 64 17-bit 
u-registers. 

FIG. 10 is a block diagram that depicts a modular multiplier circuit, according 
to an embodiment. For example, the modular multiplier circuit may be used for array 
910. As shown in FIG. 10, the modular multiplier circuit includes several 34-to-17 bit 
modular reduction circuits. 

FIG. 1 1 is a block diagram that depicts a modular reduction circuit, according 
to an embodiment. For example, the modular reduction circuit may be used for array 
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920 and as the modular multiplier circuit included in the modular multiplier circuit 
depicted in FIG. 10. 

While the example shown in FIGS. 8-11 use 17-bit numbers, 1024-bit 
operands x and y, arrays of 64 multiplier circuits and modular reduction circuits, and 
other details as shown, other implementations may differ. For example, 2048 bit 
operands x and y may be used or 512 bit operands x and y. Numbers of another bit size 
may be used, and arrays of differing numbers of multiplier circuits and modular 
reduction circuits may be used. 

C. Modular Exponentiation in RNS 
According to one embodiment, modular exponentiation in RNS is performed 
in a loop based on the inputs: 

x, y ,m, w, v that are all s-bit, 
and based on the RNS representations: 
W = Wi*W2***w n _ 1 *w ru 

V = V;[*V2#**V n _;i*V n , 

x w = (x w j, x wn ), 

Xy = (XyJ, Xyjj), 

where 

gcd(W,V) = l; 
gcd(wi,Wj)= l,fori !=j; 
gcd(Vj,Vj) = 1, for i != j; and 
w >- 4m. 
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to obtain the output: 

t = x y (mod m) 
by precomputing the following constant: 
R = W 2 (mod m), 

and by precomputing the following RNS representations and constants: 

^wi' ^vi' ^ wi> ^ vi' 

w _1 v j from w -1 yj w = 1 (mod Vj) 

R\vi» ^vi» 

Ni w from N^OV/wj) = 1 (mod Wj) for all i = (1, n) 
Nj v from Ni V (V/vi) = 1 (mod vj) for all i = (1, n) 
t w =(l, ... ,l),t v = (l, 1) 
based on the following steps: 

(1) z wi = x wi R wi (modw i ) 
Zyi = x vi R vi (mod Vj) 

(2) u wi = z wi m' wi (mod w } ) 

(3) (u v i, Uyn) <== (u w i, u wn );RNSbaseW conversion into RNSbaseV 

(4) Tyi = (Z^ + Uyjm^W'Vi (OlOd Vj) 

(5) (r wl , r wn ) <== (r vl , r vn );RNSbaseV conversion into RNSbaseW 

(6) for(y i= 1 tos) 
(6.1) if(y i= l)then 

(6.1.1) z wi = t wi r wi (modw i ) 

(6.1.2) Zyi = tvir^ (mod Vi) 

(6.1.3) u wi = Zwim'wi (mod wj) 
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(6.L4) (u vl , .... u vn ) <== (u wl , u^); RNS conversion 

(6.1.5) t vi = (z vi + Uvim^w" 1 ^ (mod Vi) 

(6.1.6) (t wl , t wn ) <== (t vl , tyn); RNS conversion 
endif 

(6.2) z wi = r wi r wi (modWi) 

(6.3) z vi = r vi r vi (modv i ) 

(6.4) u wi = z wi m' wi (mod Wi) 

(6.5) (u vl , u^) (u wl , uj; RNS conversion 

(6.6) r vi = (z vi + Uvim^w'^i (mod v { ) 

(6.7) (r wl , r wn ) <== (r vl , r vn ); RNS conversion 

(6.8) Return 

(7) Compute t w (mod w) = x y (mod m). 

FIGS. 12A-12B are a flow diagram that depicts the control flow for modular 
multiplication in RNS, according to an embodiment. In FIG. 12, rectangular blocks on the 
same horizontal level overlap execution times. 

FIG. 13 is a flow diagram that depicts the data flow for modular multiplication in 
RNS, according to an embodiment. FIG. 13 includes an array 1310 that has 64 17x17 
multiplier circuits, an array 1320 that has 64 17-bit modular reduction circuits, a register file 
1330 that has 64 17-bit rl -registers, a register file 1332 that has 64 17-bit r2-registers, a 
register file 1340 that has 64 17-bit tl -registers, and a register file 1342 that has 64 17-bit 
t2-registers. 

The modular multiplier circuit depicted in FIG. 10 may be used for array 1310. 
The modular reduction circuit depicted in FIG. 1 1 may be used for array 1320. 
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While the example shown in FIGS. 10-13 use 17-bit numbers, 1024-bit 
operands x and y, arrays of 64 multiplier circuits and modular reduction circuits, and 
other details as shown, other implementations may differ. For example, 2048 bit 
operands x and y may be used or 512 bit operands x and y. Numbers of another bit size 
may be used, and arrays of differing numbers of multiplier circuits and modular 
reduction circuits may be used. 

D. RNS Base Extension and Conversion from One RNS Base to Another RNS Base 
Given an RNS base V: 

V = Vi*V2#*#V n _;[#V n , 

and an RNS representation of a s-bit binary number r in base V: 
r v = (r v j, r vn ), 

the goal of the conversion from the first RNS base V to a second RNS base W: 

W = w i * w 2 * * * w n _ i * w n 
is the s-bit binary number r in base W: 

The bases W and V have the following properties: 
gcd(W,V) = l, 

gcd(wi,wj) = 1 for all i != j, and 

gcd(Vi,Vj)= 1, for alii !=j. 
According to the CRT, r v can be written as: 

r v = (c vl *r vl + ... + Cy/r^) (mod V) 
where c w - x can be computed according to: 
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Cvi-CV/v^ 

The constant is computed according to: 
N^V/vO = 1 (mod Vi). 

It can be proven that, for a given number of bits of r v and V in binary 
representation, the expression: 

int[(c vl *r vl + ... + Cy/r^/V] 

produces a binary number with an upper bound with respect to size (number of 
bits in binary representation). This upper bound is equal to sum of the number of bits of 
the largest modulus vj and the log of the number of moduli in given RNS n . 

Proof: Since r vi < v^ => (c vl *r vl + ... c^*^) < (c vl *Vi + ... c vn *v n ). From this, 
since c vi = (V/v^N^XCy^V! + ... + c vn * v n ) < YQii v + N n v ). On the other hand N f v < 
Vj, and V(N! V +... + N n v ) < V(Vi +... + v n ). From this we finally obtain (c vi *r vl + ... + 
c vn *r vn )/V < (v x +... + v n ). The value (vj + ... + v n ) has a bound in size (number of bits in 
binary representation) equal to the sum of bits in the largest vj and log(n)#. 

For example, in case of 16 bit moduli and RNS54, we have a maximum of 22 

bits in: 

int[(c vl *r vl + ...+ Cy/r^/V]. 
The conversion of r from RNS base V to RNS base W can be expressed as: 

( r wl> r wn) <== ( r vl> *vn)- 

In order to perform the conversion of r from base V to base W, in principle, we 
need to compute: 

r v = (c vl *r vl + ... + Cy/ryn) - V*int[(c vl *r vl + ... + c yr *r m )!Vl 
and then perform a set of reductions: 
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r wi = r v (mod w^ for i=l, ...,n. 

For conversions between two bases in each direction: 

(r wl , r wn ) <= (r vl , r vn ), and 

( r vl' •**> r vn) ^~ ( r wb r wn)> 

compute: 

r wj = r v (mod wj) 

r wj = ( c vl* r vl + - + c vn* r vn) ( mod w j) " ( V * a v) ( mod w j)5 3 = h »■> n 

and 

r v j = r v (mod vj) 

r vj = ( c wl* r wl + - + c wn* r wn) ( mod v j) " ( w * a w) ( m °d Vj); j = 1, n 

where 

a v = int[(c vl *r vl + ... + c^r^/V] 
a w = int[(c wl *r wl + ... + c wn *r wn )/W]. 
The expressions: 

d wj = (V*a v ) (mod wj) 

d WJ * = ((V (mod wj))* (a v (mod wj))) (mod wj), for j = 1, n 

and 

d vj = (W*a w ) (mod vj) 

d v j = ((W (mod Vj))* (a w (mod vj))) (mod vj), for j = 1, n 
have constant multipliers: 

V (mod Wj) and 

W (mod Vj) 
and can be precomputed and stored. 
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The expressions 

a v (mod wj) and 
a w (mod vj) 

require n modular reductions on binary numbers a v and a w . These modular 
reductions may be done on separate hardware as described below, and in parallel with 
evaluations of the rest of expressions for r w j, r v j. 

With the computed pair (d Wj -, d v j), the expressions for r w j, r v j take the following 

form: 

r wj = (c v i(ttiod Wj)*r vl + ... + c vn (mod Wj)*^) (mod wp - d wj ; j = 1, n. 
r v j = (c w i(mod vj)*r wl + ... + c wn (mod vj)*r wn ) (mod vj) - d vj ; j = 1, n. 
The constants ey, either: 

e ij = c V i(mod wj), i = 1, n; j = 1, n and 
ey = c wi (mod vj), i = 1, n; j = 1, n; 
can also be precomputed. They form two matrices of size nxn. An example of such 
a matrix is shown in the equation below. 



W 1 

0 
0 

0 



w, 



n-1 



modw. 



modw^ 



e. 



11 

e 12 

0 
0 
0 



"21 

e 22 

0 
0 
0 



e 1(n-1) e 2(n-1) 



? 1n 



y 2n 



e (n- 1)1 e n1 

e (n-1)2 e n2 

0 0 

0 0 

0 0 

0 (n-1)(n-1) e n{n-^ 

e (/?-1)/? e nn 



V 






\ 




% 


0 




0 


0 




0 


0 




0 


V, 




V, 


. V 




. V 



50325-0550 (Seq. No. 4166) 



-50- 



Application of STOJANCIC, et. al. 



The method described above, with RNS^, requires 4096 unpartitioned LUTs 
for RNS V --> RNS W and another 4096 unpartitioned LUTs of the same size for RNS W 
-> RNS V conversion. The LUTs can be made smaller by partitioning the address space 
into 4 (4bit and 5bit) spaces and then performing 4 partial product additions. For this 
case the LUTs are of size 16x17 and 32x17 bits and would grow in number to 4x4096 = 
16K for one direction conversion. The total size of LUTs is (16K + 16K)*32Bytes = 
512KB + 512KB = IMBytes of ROM. 

FIG. 14 is a diagram of a single raw architecture for RNS base conversion, 
according to an embodiment The structure shown in FIG. 14 follows from the matrix 
equation above. 

FIGS. 15 and 16 are diagrams that depict examples of counter/compressor 
structures. 

The method described above, with RNS54, requires 4096 unpartitioned LUTs 
for RNS V --> RNS W and another 4096 unpartitioned LUTs of the same size for RNS W 
-> RNS V conversion. The LUTs can be made smaller by partitioning the address space 
into 4 (4bit and 5bit) spaces and then performing 4 partial product additions. For this 
case the LUTs are of size 16x17 and 32x17 bits and would grow in number to 4x4096 = 
16K for one direction conversion. The total size of LUTs is (16K + 16K)*32Bytes = 
512KB + 512KB = IMBytes of ROM. 

FIG. 19 is a diagram of an RNS conversion pipeline that provides input to an 
array of 4:2 compressors, according to an embodiment. 

FIG. 20 is a diagram of an array of 6 levels of 4:2 compressors organized into 
six levels, according to an embodiment. 
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FIG. 21 is a diagram of eight planes of pipelined hardware for simulataneous 
execution, according to an embodiment. 

FIG. 22 is a diagram of vector d w/v computations, according to an embodiment. 
Gate count estimate for the RNS V 54<=>RNS W 6 4 conversions are determined as 

follows: 

-512KB + -512KB of LUTs (ROM). 

Or 1M of 4-to-17 decoders, whatever is smaller. 

6 levels of 4:2 compressors x 17 bit positions = 256 x 17 FAs = -20K gates. 

Final Add + Modulo Reduction: -IK gates. 

The d^ and r wi - d w i calculations: -IK gates. 

Total gate count for this module: 22K x 8 =176K gates. 

XII. MODULAR REDUCTION FOR RNS CONVERSIONS 
According to one embodiment, the expected operand size, out of the array of 
compressors, is 24 bits. However, the approach described herein is valid for an 
arbirtrary number of bits, and can be used for 34 bit reductions in the main 
exponentiation loop. 

The partitioning of an operand, r, may be expressed as: 
r (mod m) = (A + B2 16 + C2 17 + D2 22 ) (mod m) 
where m is the modulus that satisfies 
2 16 <m<2 17 and 

where A is a 16 bit value, B is single bit value, C is a 5 bit value, and D is a 2 
bit value or a 12 bit value. 
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The modulus m can be written as: 
m = 2 17 -g 

where g can take on any value between 1 and 2 16 . For example, assume that g 
is selected be an 1 1 bit number: 
l<g<2 n 

Since: 

2 17 = m + g 

then: 

r (mod m) = (A + B2 16 + C(m + g) + D2 22 ) (mod m) 

r (mod m) = (A + B2 16 + Cg + D2 22 ) (mod m) 

r (mod m) = (A + Cg + B2 16 + D2 22 ) (mod m). 
Since it can be shown that 

A + Cg < m 
the above expression reduces to evaluating 

(B2 16 + D2 22 ) (mod m) = (B2 16 + D((2 22 ) (mod m))) (mod m.) 
A simple combinatorial circuit can be used to evaluate this expression. The 
expression A+Cg requires a 5x1 1 bit multiplier and a 16 bit adder. 

XII. MODULAR REDUCTION FOR MODULAR EXPONENTIATION LOOP 
According to one embodiment, the modular reductions 
(mod Wj) and 
(mod vj) 
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are performed on the structure described above, except that the D is of size 12 

bits. 

The expected operand size, out of the array of multipliers, is 34 bits. As above, 
the operand r can be partitioned according to FIG. 18 as: 

z (mod m) = (A + B2 16 + C2 17 + D2 22 ) (mod m) 
where m is the modulus that satisfies 

2 16 <m<2 17 

and where A is a 16 bit value, B is single bit value, C is a 5 bit value, and D is 
a 12 bit value. 

As above, modulus m can be written as: 
m = 2 17 -g 

where g can take on any value between 1 and 2 16 . For example, assume that g 
is selected to be an 1 1 bit number: 
Kg<2 n 

Since: 

2 17 = m + g 

then: 

z (mod m) = (A + Cg + B2 16 + D2 22 ) (mod m). 
Since it can be shown that A + Cg < m, the above expression reduces to 
evaluating: 

(B2 16 + D2 22 ) (mod m) = (B2 16 + D((2 22 ) (mod m))) (mod m) 
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Again, a combinatorial circuit can be used to evaluate this expression. The 
expression A+Cg requires a 5x11 bit multiplier and a 16 bit adder. 

XIII. FINAL BINARY RESULT COMPUTATION 
The final binary result computation is the final step in the modular 
exponentiation approach (i.e., the RNS representation to binary representation 
conversion). 

Given an RNS representation of a s-bit binary number t: 

t w = (t w i> t wn ) 
in base W: 

W = w 1 *w 2 ***w n _ 1 *w n , 
the goal is to compute the binary representation of t w . According to the CRT t w 
can be written as: 

tw = ( c wl*twl + - + Cwn*twn) ( mod W )> 
where c wi can be computed according to: 

c w i = (W/w i )N i w . 
The Ni W is computed according to: 

Ni W (W/Wi) = 1 (mod Wj). 
To obtain: 

(mod W) 

compute: 

tw = (c w i*t w i + - + <>wn*twii) " W*int[(c wl *t wl + ... + c wn *t wn )/w], 
For RNS64, the expression 
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a = int[(c wl *t wl + ... + c wn *t wn )/W] 
is computed from: 

a = intKN^/wOtw! + ... + (N^m)^. 
This computation may be performed on an array of integer/fractional 
multiply-add structures (with constant multiplicand) described before in the context of 
RNS V <=> RNS W conversions. The existing structure can be re-used. 
The expression: 
(W*a) 

is evaluated on a constant-multiplicand multiplier of size 1024x22, again in a 
mixed form structure, lookup tables and partial product adders. 

The expression: 

b = (c w i*t w i + ... + c w64 *t w64 ) 

is evaluated in parallel/series on a structure that represents a combination of 
lookup tables and constant-multiplicand multiplier of size 1024x17. This expression can 
also be evaluated, in microcode, on the RMC core. 

The three operations for computing a, (w*a) and b, are performed over 
multiple clock cycles. Since this RNS representation to binary representation conversion 
is done only once, at the end of modular exponentiation computation, the use of multiple 
clock cycles does not affect the overall throughput. 

As before, for infrequent occasions, when the evaluation of a fails, a hardware 
module for precise, direct evaluation of b/w is engaged. 
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FIG. 18 is a diagram that depicts an implementation of RNS to binary 
conversion, according to an embodiment. FIG. 18 depicts the implementation of c^*^ 
multiplication with a combination of LUTs and an array of 4:2 compressors. The final 
adder and the accumulator can be made as fast as necessary, and the total gate count 
depends on such implementation. One estimate is ~30K gates are used. The 
accumulation is performed 64 times in a loop. Total ROM memory required is 512KB. 

A hardware size estimate for the RNS^-to-Binary conversion module is as 

follows: 

"a" value calculation unit: the existing module reused; 

(W*a) value calculation unit --> 5 LUTs + reused adders from the above 
circuit - b value calculation; 

"b" value calculation unit --> 512KB LUTs + ~30K gates; and 

estimated gate count for the correction unit --> 20K gates. 

The total equivalent NAND-gate count estimate for the RNS^-to-binary 
conversion is ~50K gates and -524KB LUTs (or equivalent decoders). 

XIV. ADDITIONAL ARCHITECTURAL CONSIDERATIONS 
The 64 moduli wj and Vj (not necessarily prime) are selected as follows; 
W = w 1 *w 2 ***W53*W5 4 , and 

V = V|*V2***V^3*V^4 

such that 

gcd(W,V) = l, 

gcd(w i? Wj) = 1, for all i != j, and 
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gcd(v i? Vj) = l,for all i != j. 
One example of the set of and Vi in the range 
2 16 <Wi,Vi<2 17 

is given in Table 1, wherein the first 64 values are for one base and the second 
64 values are for the second base. 





Table 1 














129527 


129529 


129533 


129539 


129553 


129581 


129587 


129589 


129593 


129607 


129629 


129631 


129641 


129643 


129671 


129707 


129719 


129733 


129737 


129749 


129757 


129763 


129769 


129793 


129803 


129841 


129853 


129887 


129893 


129901 


129917 


129919 


129937 


129953 


129959 


129967 


129971 


130003 


130021 


130027 


130043 


130051 


130057 


130069 


130073 


130079 


130087 


130099 


130121 


130127 


130147 


130171 


130183 


130199 


130201 


130211 


130223 


130241 


130253 


130259 


130261 


130267 


130279 


130303 


130307 


130337 


130343 


130349 


130363 


130367 


130369 


130379 


130399 


130409 


130411 


130423 


130439 


130447 


130457 


130469 


130477 


130483 


130489 


130513 


130517 


130523 


130531 


130547 


130553 


130579 


130589 


130619 


130621 


130631 


130633 


130639 


130643 


130649 


130651 


130657 


130681 


130687 


130693 


130699 


130729 


130769 


130783 


130787 


130807 


130811 


130817 


130829 


130841 


130843 


130859 


130873 


130927 


130957 


130969 


130973 


130981 


130987 


131009 


131011 


131023 


131041 


131059 


131063 







Assume that the modulus m is a large number of 2048 bits in radix 2 weighted 
number representation. Therefore, modulo m in encryption protocol computations, such 
as RSA, is a product of two large primes, p and q: m = pq. According to the CRT all 
modulo computations can be performed separately, relative to p,q (for selected 
maximum key size of 2048, p and q are both 1024 bit, radix 2, numbers), and then the 
final result is recombined. As a result, the modular arithmetic is built around two 
identical 1024 bit ALUs. Thus the RNS decompositions are performed on 1024 bit 
numbers. 

According to one embodiment, RNS multiplications are performed in parallel 
on 17 bit operands for all non-constant multiplications. Two independent arrays of 
multipliers, for 17x17 bit integer multiplication and 17x32 bit (constant multiplicand) 
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fractional multiplication, form the multiplier complex with estimated gate count of 300K 
gates. 

In addition to the multiplier array for RNS conversion, an array of 63 -18 bit 
adders is created that produces the array multiplier final sum. Similar configuration of 
adders is formed for the array of fractional multipliers. Total allocation of gates for two 
adder arrays is ~20K gates. 

The modular reduction of 34 bit operands is handled on the array of 64 
modular reduction circuits as presented above. 

The exceptional case, when the result of fractional operations is too close to an 
integer, is handled either by the RMC processor core or by specialized hardware that will 
compute precise quotient. The additional hardware unit for this function can be based on 
the exact modulus determination of CRT recombined numbers. It will be operating on 
different time schedule (be considerable slower that regular RNS conversion). 

The estimate of the total gate count for a single 1024 bit ALU is -450K gates 
of random logic. This number needs to be doubled, to -900K gates (including the 
exception hendling hardware) for two ALUs running in parallel to produce 2048 bit 
results. 

Based on the above description of this particular implementation the following 
will provide time estimate for the most critical operations in an RSA decryption 
implementation using modular exponentiation on 2048 bit operands (both base and 
exponent), according to an embodiment. 

For the timing estimate purposes consider the approach above for modular 
exponentiation with particular architectural and design data. 
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Since m = pq, for p,q prime each 1024 bit in radix 2 representation, modular 
exponentiation is performed based on either p or q by using 1024 bit arithmetic. 
The x,y in radix 2 representation are: 
x = (Xj, X1024), 

y = (yi, yio24)- 

Assume that the operand x is represented in two RNS bases, W and V, with 64 
residues, RNS^ Therefore, the two RNS representations of x are: 



In RNS^ all precomputed constants are 17 bit. Moduli wj and vj are 
determined to be 17 bit numbers. 

The timing analysis of one embodiment, including time overlap, is as follows: 
3 elks (1) = x^R^ (mod Wi); Zyj = XyiRyi (mod Vi) 

3 elks (2) u wi = z wi m' wi (mod w,) 

10 elks (3) (u vl , Uvn ) <== (u wl , u^) 

4 elks (4) r^ = (Zyj + Uvimv^w" 1 ^ (mod Vi) 
10 elks (5) (r wl , r wn ) <= (r vl , r vn ) 



x w _ ( x wl> — ♦ x w64)> afld 



a 




(6) for (s = 1 to 1024) { 



(7) if (ft =1) then { 



overlap (10) 



(7.1) z^i = t^r^ (mod Wj); = t vi r vi (mod Vj). 



3 elks 



(7.2) u^ = z wi m 'wi ( mod w i) 



6 elks, overlap (11) (7.3) (u vl , u m ) <== (u wl , Uwn ) 
overlap (12) (7.4) t^ = (z^ + Uvim^w" 1 ^ (mod vi) 
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10 Clks (7.5) (Vi, t wn ) <== (t vl , 1^). 

} 

3 elks (8) ^ = r wi r wi (mod Wi); Zyj = rvjTvi (mod Vi) 

3 elks (9) u wi = z wi m' wi (mod wj) 

10 elks (10) (u vl , u w ) <== (u wl , u wn ) 

4 elks (11) r vi = (Zyi + u vi m vi )w" 1 vi (mod v^) 
10 elks (12) (r wl , r wn ), <== (r vl , r^) 

} 

6 elks (13) Compute t w (mod w). 



The total number of cycles, assuming an average of 50% 0 bit distribution in 
the exponent radix 2 representation is: 

30 + 19*512 + 30*1024 + 6 ~40,000cycles. 

For a 200MHz clock, the time is computed as: 

40,000*5 = 200,000ns = 0.2ms 
or -5,000 RSA calculations per second, according to one embodiment. 
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XV. HARDWARE OVERVIEW 
FIG. 23 is a block diagram that depicts a computer system 2300 upon which an 
embodiment of the invention may be implemented. Computer system 2300 includes a bus 
2302 or other communication mechanism for communicating information, and a processor 
2304 coupled with bus 2302 for processing information. Computer system 2300 also includes 
a main memory 2306, such as a random access memory (RAM) or other dynamic storage 
device, coupled to bus 2302 for storing information and instructions to be executed by 
processor 2304. Main memory 2306 also may be used for storing temporary variables or other 
intermediate information during execution of instructions to be executed by processor 2304. 
Computer system 2300 further includes a read only memory (ROM) 2308 or other static 
storage device coupled to bus 2302 for storing static information and instructions for processor 
2304. A storage device 2310, such as a magnetic disk or optical disk, is provided and coupled 
to bus 2302 for storing information and instructions. 

Computer system 2300 may be coupled via bus 2302 to a display 2312, such as a 
cathode ray tube (CRT), for displaying information to a computer user. An input device 2314, 
including alphanumeric and other keys, is coupled to bus 2302 for communicating information 
and command selections to processor 2304. Another type of user input device is cursor control 
2316, such as a mouse, a trackball, or cursor direction keys for communicating direction 
information and command selections to processor 2304 and for controlling cursor movement 
on display 2312. This input device typically has two degrees of freedom in two axes, a first 
axis (e.g., x) and a second axis (e.g., y), that allows the device to specify positions in a plane. 

The invention is related to the use of computer system 2300 for using pre- 
computation and dual-pass modular operations to implement encryption protocols efficiently 
in electronic hardware. According to one embodiment of the invention, using 
pre-computation and dual-pass modular operations to implement encryption protocols 
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efficiently in electronic hardware is provided by computer system 2300 in response to 
processor 2304 executing one or more sequences of one or more instructions contained in 
main memory 2306. Such instructions may be read into main memory 2306 from another 
computer-readable medium, such as storage device 2310. Execution of the sequences of 
instructions contained in main memory 2306 causes processor 2304 to perform the process 
steps described herein. In alternative embodiments, hard-wired circuitry may be used in 
place of or in combination with software instructions to implement the invention. Thus, 
embodiments of the invention are not limited to any specific combination of hardware 
circuitry and software. 

The term "computer-readable medium" as used herein refers to any medium that 
participates in providing instructions to processor 2304 for execution. Such a medium may 
take many forms, including but not limited to, non-volatile media, volatile media, and 
transmission media. Non-volatile media includes, for example, optical or magnetic disks, such 
as storage device 2310. Volatile media includes dynamic memory, such as main memory 
2306. Transmission media includes coaxial cables, copper wire and fiber optics, including the 
wires that comprise bus 2302. Transmission media can also take the form of acoustic or light 
waves, such as those generated during radio-wave and infra-red data communications. 

Common forms of computer-readable media include, for example, a floppy disk, a 
flexible disk, hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other 
optical medium, punchcards, papertape, any other physical medium with patterns of holes, a 
RAM, a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a 
carrier wave as described hereinafter, or any other medium from which a computer can read. 

Various forms of computer readable media may be involved in carrying one or more 
sequences of one or more instructions to processor 2304 for execution. For example, the 
instructions may initially be carried on a magnetic disk of a remote computer. The remote 
computer can load the instructions into its dynamic memory and send the instructions over a 
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telephone line using a modem. A modem local to computer system 2300 can receive the data 
on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal. 
An infra-red detector can receive the data carried in the infra-red signal and appropriate 
circuitry can place the data on bus 2302. Bus 2302 carries the data to main memory 2306, 
from which processor 2304 retrieves and executes the instructions. The instructions received 
by main memory 2306 may optionally be stored on storage device 2310 either before or after 
execution by processor 2304. 

Computer system 2300 also includes a communication interface 2318 coupled to bus 
2302. Communication interface 2318 provides a two-way data communication coupling to a 
network link 2320 that is connected to a local network 2322. For example, communication 
interface 2318 may be an integrated services digital network (ISDN) card or a modem to 
provide a data communication connection to a corresponding type of telephone line. As 
another example, communication interface 2318 may be a local area network (LAN) card to 
provide a data communication connection to a compatible LAN. Wireless links may also be 
implemented. In any such implementation, communication interface 2318 sends and 
receives electrical, electromagnetic or optical signals that carry digital data streams 
representing various types of information. 

Network link 2320 typically provides data communication through one or more 
networks to other data devices. For example, network link 2320 may provide a connection 
through local network 2322 to a host computer 2324 or to data equipment operated by an 
Internet Service Provider (ISP) 2326. ISP 2326 in turn provides data communication 
services through the world wide packet data communication network now commonly 
referred to as the "Internet" 2328. Local network 2322 and Internet 2328 both use electrical, 
electromagnetic or optical signals that carry digital data streams. The signals through the 
various networks and the signals on network link 2320 and through communication interface 
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2318, which carry the digital data to and from computer system 2300, are exemplary forms 
of carrier waves transporting the information. 

Computer system 2300 can send messages and receive data, including program code, 
through the network(s), network link 2320 and communication interface 2318. In the 
Internet example, a server 2330 might transmit a requested code for an application program 
through Internet 2328, ISP 2326, local network 2322 and communication interface 2318. 

The received code may be executed by processor 2304 as it is received, and/or stored 
in storage device 2310, or other non-volatile storage for later execution. In this manner, 
computer system 2300 may obtain application code in the form of a carrier wave. 

XVI. EXTENSIONS AND ALTERNATIVES 
In the foregoing specification, the invention has been described with reference to 
specific embodiments thereof. It will, however, be evident that various modifications and 
changes may be made thereto without departing from the broader spirit and scope of the 
invention. The specification and drawings are, accordingly, to be regarded in an illustrative 
rather than a restrictive sense. 
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